Characterizing Uncertain Data using Compression
نویسندگان
چکیده
Motivated by sensor networks, mobility data, biology and life sciences, the area of mining uncertain data has recently received a great deal of attention. While various papers have focused on efficiently mining frequent patterns from uncertain data, the problem of discovering a small set of interesting patterns that provide an accurate and condensed description of a probabilistic database is still unexplored. In this paper we study the problem of discovering characteristic patterns in uncertain data through information theoretic lenses. Adopting the possible worlds interpretation of probabilistic data and a compression scheme based on the MDL principle, we formalize the problem of mining patterns that compress the database well in expectation. Despite its huge search space, we show that this problem can be accurately approximated. In particular, we devise a sequence of three methods where each new method improves the memory requirements orders of magnitudes compared to its predecessor, while giving up only a little in terms of approximation accuracy. We empirically compare our methods on both synthetic data and real data from life science. Results show that from a probabilistic matrix with more than one million rows and columns, we can extract a small set of meaningful patterns that accurately characterize the data distribution of any probable world.
منابع مشابه
Modeling of Compression Curves of Flexible Polyurethane Foam with Variable Density, Chemical Formulations and Strain Rates
Flexible Polyurethane (PU) foam samples with different densities and chemical formulations were tested in quasi-static stress-strain compression tests. The compression tests were performed using the Lloyd LR5K Plus instrument at fixed compression strain rate of 0.033 s-1 and samples were compressed up to 70% compression strains. All foam samples were tested in the foam rise direction and their ...
متن کاملRepresenting Data without Lost Compression Properties in Time Series: A Review
Uncertain data is believed to be an important issue in building up a prediction model. The main objective in the time series uncertainty analysis is to formulate uncertain data in order to gain knowledge and fit low dimensional model prior to a prediction task. This paper discusses the performance of a number of techniques in dealing with uncertain data specifically those which solve uncertain ...
متن کاملMining Frequent Patterns in Uncertain and Relational Data Streams using the Landmark Windows
Todays, in many modern applications, we search for frequent and repeating patterns in the analyzed data sets. In this search, we look for patterns that frequently appear in data set and mark them as frequent patterns to enable users to make decisions based on these discoveries. Most algorithms presented in the context of data stream mining and frequent pattern detection, work either on uncertai...
متن کاملA Probabilistic Model to Characterize the Uncertainty of Web Data Integration: What Sources Have The Good Data?
There is a large amount of data that is published on the web. Several techniques have been developed to extract and integrate data from Web sources. However, Web data is inherently imprecise and uncertain. Novel approaches to deal with the uncertain data have been recently proposed. However, they assume an uncertain degree is already associated with the data. This paper addresses the issue of c...
متن کاملExergy and Energy Analysis of Diesel Engine using Karanja Methyl Ester under Varying Compression Ratio
The necessity for decrease in consumption of conventional fuel, related energy and to promote the use of renewable sources such as biofuels, demands for the effective evaluation of the performance of engines based on laws of thermodynamics. Energy, exergy, entropy generation, mean gas temperature and exhaust gas temperature analysis of CI engine using diesel and karanja methyl ester blends at d...
متن کامل